Some reflections on 



data in tlie public sector 




Tom Moritz 



nternet Archive 



London School of Economics 

London 
March 26-27, 2009 





The Eiiro].i€an Them a lie Ketvvork on the Digital Public Domain 



"Data" 




Clear definitions are good. We should not rely on 



metaphysical "solving" / "power-bringing" words 



So the universe has always appeared to the natural nnind as a kind ofenignna, of 
which the key must be sought in the shape of some illuminatinQ or power-bringing 
word or name. That word names the universe's principle, and to possess it is, after 

a fashion, to possess the universe itself 

'God/ 'Matter/ 'Reason/ 
'the Absolute/ 'Energy/ 



[''Knowledge"/ "Information"/ "Data" - added] 

are so many solving names. 

You can rest when you have them. You are at the end of your metaphysical quest. 



// 



William Jannes. "What Pragmatism Means", Lecture 2 in Pragmatism: A new name for some old ways of 

thinking. New York: Longman Green and Co (1922): 52-52. 
http://www.archive.orR/stream/praRmatismnewnamOOjame 
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Internet Archive: 

http://www.archive.org/stream/praRmatismnewnamOOiame 



Note Date of Publiction: 1922 



Data? 



/; 



n common usage ''data'' refers both 

to an electronic medium of exchange (this is the 
definition applied by the US NSF ''DataNet 
solicitation) 

And 

disciplinarily / epistemically it refers to forma 



i 



consistent, conventional expressions of facts 
(observations/ measurements 



We should be clear how we are using the term. 

[BTW in normal usage ''data'' can be singular or plural...?] 



Digital Explosion 



'The digital universe in 2007 — at 2.25 x 1021bits (281 exabytes or 
281 billion gigabytes) — was 10% bigger than we thought. The 
resizing comes as a result of faster growth in cameras, digital TV 
shipments, and better understanding of information replication. 



"By 2011, the digital universe will be 10 times the size it was in 
2006. 



"As forecas t the amount of information created, captured, or 
replicated exceeded available storage for ttie first time in 2007 . 

Not all information created and transmitted gets stored, but by 
2011, almost half of the digital universe will not have a permanent 
home." 



The Diverse and Exploding Digital Universe: An Updated Forecast of Worldwide 
Information Growth through 2011. An IDC Whitepaper 
www.emc.com/collateral/analvst-reports/diverse-explodinQ-diQital-universe.pdf 



Data is now more than ever available 
in highly diverse formats from very disparate sources 





Validation of data and critical awareness and analysis of data sources is essential 






We must address both legacy data and current/ 
prospective data 

Many data sets - to be fully useful must be 
significantly longitudinal - for example 
biological taxonomy - but also climate, 
oceanography, etc 

Older data sets while essential may be much 
more problematic 

Russian Chronicles of Nature / zapovedniks 

US LTER Trout Lake, Wl example - (Geof Bowker) 
California Fish & Game 



NCAR Research Data Archive (RDA) 



Period of Record for Atmospheric Soundings in the RDA 
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C.A. Jacobs, S. J. Worley, ''Data Curation in Climate and Weather: Transforming our ability to improve predictions through global knowledge 

sharing /' from the 4th International Digital Curation Conference December 2008 , page 7. www.dcc.ac.uk/events/dcc- 

2008/programme/papers/Data%20Curation%20in%20Climate%20and%20Weather.pdf [03 02 09] 



The NCAR Research Data Archive (RDA) 



// 



The NCAR Research Data Archive (RDA) is a comparatively small 
(currently 246 TB, less than 5% of the MSS [Mass storage system] total 
size), but very important, part of the MSS stored data. The RDA has 
been cu rated by the staff in the Computational and Information 
Systems Laboratory for over 40 years, [emphasis added] and as such 
contains reference datasets used by large numbers of scientists. 
The RDA contents are long-term atmospheric (surface and upper 
air) and oceanographic observations, grid analyses of observational 
datasets, operational weather prediction model output, reanalyses, 
satellite derived datasets, and ancillary datasets, such as 
topography/bathymetry, vegetation, and land use. The RDA is not 
a static collection; it is now over 580 datasets with about 100 
routinely updated and 10-20 new ones added each year . " 



C.A. Jacobs, S. J. Worley, ''Data Curation in Climate and Weather: Transforming our ability to improve predictions through global knowledge 

shoring /' fronn the 4th International Digital Curation Conference December 2008, page 5. www.dcc.ac.uk/events/dcc- 

2008/programme/papers/Data%20Curation%20in%20Climate%20and%20Weather.pdf [03 02 09] 





some instances we are working at "peta 



scale 



// 





Tinis lias dramatic implications for future full 

ife cycle management 

quantity becomes quality? 



CERN Accelerators 

(not to scale) 
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Error*, 



The $3.6 billion Large Hadron Collider 
(LHC) will sample and record the 
results of up to 600 million proton 
collisions per secon d, producing roughly 
15 petabytes (15 million gigabytes) of 
data annually in search of new 
fundamental particles. To allow 
thousands of scientists from around the 
globe to collaborate on the analysis of 
these data over the next 15 years (the 
estinnated lifetime of the LHC), tens of 
thousands of computers located around 
the world are being harnessed in a 
distributed computing network called 
the Grid. Within the Grid, described as 
the most powerful supercomputer 
system in the world, the avalanche of 

data will be analyzed, shared, re- 
purposed and combined in innovative 
new ways designed to reveal the 
secrets of the fundamental properties 
of matter. 



LHC source: 
http://public.web.cern.ch/public/en/LH 

C/LHC-en.html 

Source: 

http://public.web.cern.ch/Public/en/LH 

C/LHC-en.html 



start the protons out here 
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DATA SETS 



some 
examples 



with 



// 



native 



metadata 



// 



2-d_soil_temps.csv 

surface, and sub-surface soil temperatures (at 2cm and 8cm depths) measured at one location for a few days in order to 

calibrate a model of temperature propagation. Surface temperature was measured with an infrared thermometer, 
subsurface temperatures with a thermocouple. 



5-minute_light_data_for_4_continuous_days_plus_reference.xls 

PPF (photosynthetic photon flux = photosynthetlcally active radiation 400-700nm) measured with an array of photodiodes 

calibrated to a Licor sensor, along a linear transect for a few days, used to get an idea of how much light plants along 
the transect are receiving. 



C02_of_air_at_different_heights_Julv_9.xls 

concentration of C02 in the air during the evening for one day, measured with a Licor infrared gas analyzer and a series of 

relays and tubes with a pump, used to examine the gradient of C02 coming from the soil when the air is still during the 
evening. 



Fern_light_response.xls 



Light response curves for bracken ferns, measured with a Licor photosynthesis system. Fronds are exposed to different light 

evels and their instantaneous photosynthesis and conductance is measured, used in conjunction with the induction 
data (below) for physiological characterization of the ferns. 



La_Selva_species_photosyntheis_table.xls 

incomplete data set on instantaneous photosynthesis rates for various tropical understory and epiphytic species grown in a 

shade house in Costa Rica. 



manzanita_sapflow_12-5-07_to_7-7-08.xls 

instantaneous sap flow data (as temperature differences on a constant temperature heat dissipation probe) for multiple 

branches of Manzanita, collected with a datalogger, used to correlate physiological activity with below-ground 
measures of root grown and C02 production. 



moisture release curves.xls 



percentage of water content, water potential (in MegaPascals) and temperature of soil samples, measured in the laboratory 

for calibration of water content with water potential, soil is from the James Reserve in California. 



Photosvnthetic_induction.xls 

a time-course of photosynthetic induction for a leaf over 35 minutes, instantaneous photosynthesis measured as ^mol C02 

m/2/s and light level is probably 1000 micromoles. used to determine physiological characteristics of bracken ferns. 



ru n_2_24-h_data_for_mesh .xls 

measurements of mlcrometeorological parameters on a moving shuttle, going from a clearing across a forest edge and into 

the forest for about 30 meters. Pyronometers facing up and down, pyrgeometer facing up and down, PAR, air 
temperature, relative humidity. Also data from a station fixed in the clearing and some derived variables calculated, 
used for examining edge effects in forests. 



Segment_of_wallflower_compare_colorspaces_blur.xls 



pixel counts from images of wallflowers that were segmented into flower/not-flower under different color spaces. 

segmentation was made using a probability matrix of hand-segmented images, used to automatically count flowers in 

images collected after this training data was collected {and used to determine the best color space for this task). 



manzanita_sapflow_12-5-07_to_7-7-08.xls 

instantaneous sap flow data (as temperature differences on a constant temperature iieat 
dissipation probe) for multiple branches of Manzanita, collected with a datalogger. 

used to correlate physiological activity with below-ground measures of root grown and 
C02 production. 



sbid battery datetime heater_voltage ManzlSapl Manz1Sap2 Manz1Sap3 Manz1Sap4 Manz2Sap5 Manz2Sap6 Manz2Sap7 Manz3SapW Manz3Sap8 Manz3Sap9 Manz4Sap11 timestamp Datagap Julian 
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From an Excel Spreadsheet 




Access per se does not equal fitness for use 





Provenance or context are essential to give 
meaning to data [SEE February letter to 
Science: ''Keeping Raw Data in Context' 

Geo-scale is of particular importance for PSI 




Mechanisms for maintaining provenance - 
through combinations and re-combinations of 
data -- are essential 




GBIF in Copenhagen has recently formed 
Data Publistiing Frameworl< Tasl< Group 




GBIF- October, 2008 

(as a result of the Darwin Core reductionist data analysis...) 



GBIF UDDI Registry 

* registration 

* update information 



Data Providers 259 



Datasets 7481 



Searchable Records 147.539.975 



9^^^ » 



http://www.gbif.org/ [clipped Oct 8, 2008] 







Data does not respect sectors - it is easy to 
envision integral data sets drawn from public 
private for-profit / not-for-profit sectors 

At a recent US NAS hearing the Dow 
Chemical Company reported that it had several 
hundred thousand technical reports in a 
proprietary corporate collection... 

The greatly extended latency (?) of public access 

to this work is a violation of a fundamental 
principle of science 

We must exert pressure for free/ open access 

and use in all domains (Wellcome Trust has 
been exemplary) 



Sakhalin Energy relocates offshore 

pipelines to protect whales 



30/03/2005 




// 



Yuzhno-Sakhalinsk, Russian Federation, 30 



IVIarch 2005: Sakhalin Energy will reroute 
offshore pipelines in its oil and gas 
development in the Russian Far East to help 
protect the endangered western gray whale. 



// 



http://www.shell.conn/home/content/nnedia/news and library/press releases/2 
005/sakhalin energy relocates pipeline 30032005.html 



We do not know how data might be used 

and who might use it... 




Evolution and Ecology of the 

Digital Domain 



stages of Digital Library Development 



stage 


Date 


Sponsor 




Purposo 






1: 

Experimental 


1994 


NSF/ARPA/NASA 


Experiments on collections of digita materials 




II: 
Developing 


1 998/1 999 


NSF/ARPA/NASA, DLF/CLIR 


Begin to consider custodianship, sustainabi ity, user 
communities 




III: Mature 


? 

■ 


Funded through normal 

channe s? 


Real sustainable interoperable digital libraries 



Howard Besser. Adapted from The Next Stage: Moving from Isolated Digital Collections to Interoperable Digital Libraries 

by First Monday, volume 7, number 6 (June 2002), 

URL: http://firstmonday.org/issues/issue7_6/besser/index,html 
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|c. Copyj-iehl 2000 by Paul F. Ulilir. Notes 



THE ROLE OF SCIENTIFIC AND TECHNICAL DATA AND INFORMATION IN THE PUBLIC DOMAIN PROCEEDINGS OF A SYMPOSIUM JuWe M. Esanu 

and Paul F. Uhlir, Editors Steering Committee on the Role of Scientific and Technical Data and Information in the Public Domain Office of 
International Scientific and Technical Information Programs Board on International Scientific Organizations Policy and Global Affairs Division, 
National Research Council of the National Academies, p. 5 



References to Intellectual Property 

in U.S. federal cases 



/; 



2000 



1500 



1000 



500 








"Intellectual Property" 



1900- 
1919 



1 



1920- 
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1949 
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1970- 
1979 



56 





"Intellectual 
Property 



1980- 
1989 



341 



1990- 
1999 



1721 



"Professor Hank Greely" Cited in Lessig, L. The future of ideas: the fate of the commons in 
a connrcted world. NY, Random House, 2001. P. 294. 



Graph of "The Knowledge Life Cycle 



// 



Julian Birkinshaw and Tony Sheehan, ''Managing the 



Knowledge Life Cycle/^ 

MIT Sloan Management Review, 44 (2) Fall, 2002: 77, 




Shows: "Creation -^ Mobilization -^ Diffusion 

Commoditization" of knowledge as developmental cycle over 
time with "access" increasing significantly to final 
"commoditization" stage. . . . 

Added annotation: "Is scientific knowledge a commodity ?" 



Flier from 1941 cartoonists strike at Disney 

Studios 

"Mickey Mouse wears an AFL (American 
Federation of Labor) button and carries a 

placards that reads "Disney UNFAIR." Bottom 
edge reads 'Printed by Disney Strikers on Offset 
Duplicator. Hand made StenciK " 



[Metadata: Strikes and lockouts -- Motion 
picture industry; Walt Disney Productions; 
Disney characters; Mickey Mouse; Motion 
picture industry -- Employees -- Labor 
unions; American Federation of Labor; 
Animators; Brotherhood of Painters, 
Decorators, and Paperhangers of America.; 
Screen Cartoonists Local Union No. 852 
(Hollywood, Calif.); Animation Guild and 
Affiliated Optical Electronic and Graphic 
Arts, Local 839 I.A.T.S.E. (North Hollywood, 
Los Angeles, Calif.); Motion Pictures Screen 

Cartoonists Local 839, I.A.T.S.E. ] 



Cal State Univ Northridge 




Pnnied hf Disneij Slnktn on Offset DuplfOLhn Hand made SienciL 

(over) 



http://digitallibrarv.csun.edu/cdm4/resuits.php?CIS00Pl=anv&CIS0B0Xl=Disnev&ClS0FIELDl=CIS0SEARCHALL&aS0R00T=all&su 



Perhaps certain types 



of "cultural properties 

are inevitably 
commodities? 



// 



Perhaps some cultural objects and works WITH HIGH MARKET 
VALUE will inevitably fall into restricted use [ art , talkies, 
vampire novels...?] - but much work - including orphaned works 
and out-of-print work [demonstrably non-commercial ] should be 
available for access and use 

The groups that sued Google are representative of major 
commercial interests 

The "long tail" case seems convincing but we must consider the 
societal cost-benefit analysis that leads from it to severe 
restrictions on access in exchange for very marginal cost-benefits to 
individual producers 

Perhaps some simple one-time opt-out, opt-in or buy-out? 

Or as Jonas Salk noted the reward is the ability to go on and to do 
more... 






Libraries, archives and museums have - for better 
or worse - long been the accepted repositories for 
human knowledge . . . 

The notion of commercial ''corporations" serving 
as custodians of knowledge is highly problematic 

A problem of mission 

Microsoft made a business decision last year (2008) to 
stop digital activity 

the oldest known human corporation [Japan's Kongo 
Gumi ~ a construction company founded in 578 ] was 

sold and consolidated into another company 




Monopolies and cartels are bad - Elsevier 





n 2004 in the Washington Post Elsevier reported 
a 34% profit margin 

But they are clever ("smartest guys in the 

room"?-- ENRON? AIG? ....Google?) 



There are powerful, well-formed 

arguments for the contributions 

open access and effective use of 

data to the public welfare. 



These arguments are drawn from notions of 



Human rights/ Fairn 
Secular democracy 
Civic responsibility 
The ethos of science 




The ethos of conservation 



Education / Scientific literacy 
Public health 



And others... 



ii 



If Avian Flu Has Passed Us By Here's Why... 

(NYT) 



// 




Chart showing global spread of avian flu 



together with hemispheric avian migration 
routes... 




Text added: 




"How many data sources contributed to this 
analysis...?" 



Polemically / politically there is a spectrum of public 

welfare that argues that much - perhaps not all ? 
data should be released? 



OR perhaps a// of /( should be released??? 

Consider the Faustian / Klaus Fuchs / Abdul Qadeer Khan 

syndrome - see NY Review of Books: Volume 56, Number 

• April 9, 2009 Jeremy Bernstein, He Changed History ) ] 




Note how many of Thursday's arguments focused on human 
health and welfare- it is the easiest/ most obvious case 



ALL knowledge? Or perhaps, an ethical spectrum ? 

polemics of support for the Science Knowledge Commons 



the 



Human Health Agriculture Science- [Biotechnology] 

Tech 





Earth Education [ Nuclear Technology ] 

Science/Conse 

rvation 



National rnstitutes of Health FY 2008 Budget 

by Funding Mechanism 



Another**, $0,5 

Other Research, 

$1.7 



Research Training, 

$0.8 



R&D Contracts, 33.0 



Intramural 
Research, $2.7 



Res. Centers, $2.9 




Small Business 
Grants, $0.6 



Source: NIH agency budget j; 
FE3. '07 3 2007/^40 




RPGV, $14.6 




Research Project 
Grants. 

- Inciudes 
management and 
support, Library of 
Medicine, and Office of 

the Director. 



http://www.aaas.org/spp/rd/fv08.htm 



National Science Foundation Budget, FY 2000-2008 

(budget aythority in billions of constant FY 2007 dollars) 
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Source: National Science FoMndarcn, and !atesi AAAS estinales of FY 2006 
budget. FY 2008 is budgel request; FY 2007 is ealirnate of Tnal 
approprialion. 

FEB. '07 REVISED © 2007 AAAS 




http://www.aaas.org/spp/rd/fv08.htnn 




The global community is focusing on full-life 



cycle management of data 

Particularly including curation and preservation 
[bit rot 

• Migration? / Emulation? 

• Trusted Digital repositories 





8" 




Migration / Emulation ??? 
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compatible 

'SS,SD,32RH 

single Side 

single density 
32 sectors 




WRITE PROTECT 

Thfs Diskette has the capability 
of being write protected by the 
hoJe indicated by the arrow below. 
When the hole is open, It is pro- 
tected; when covered, writing is 
allowed. The hole is closed by 
pJacing a tab over the front of the 
hole, and foldeciover covering the 
rear of the hole. The Diskette can 
then be write protected by remow- 

log the tab. 



^T-i. 




Preservation 





nternet Archive 




A focus on broadband alone is not sufficient 




Tine teciinoiogy for affordable mass 
digitization exists - should be part of any 
economic stimulus effort 



Library of Congress scanning center / FedLInk 
eligibility for US Federal governmental contracts 

WayBackMachine / Archive-It/ Internet Archive 

• 150 billion Web pages 

NASA Images Project 



Wednesday, January 21st, 2009 at 12:00 am 
Freedom of Information Act 



MEMORANDUM FOR THE HEADS OF EXECUTIVE DEPARTMENTS AND AGENCIES 



SUBJECT: Freedom of Information Act 



A democracy requires accountability, and accountability requires transparency. As Justice Louis Brandeis 
wrote, "sunlight is said to be the best of disinfectants." In our democracy, the Freedom of Information Act 
(FOIA), which encourages accountability through transparency, is the most prominent expression of a 
profound national commitment to ensuring an open Government. At the heart of that commitment is the 
idea that accountability is in the interest of the Government and the citizenry alike. 



The Freedom of Information Act should be administered with a clear presumption: In the face of doubt, 
openness prevails. The Government should not keep information confidential merely because public 
officials might be embarrassed by disclosure, because errors and failures might be revealed, or because of 
speculative or abstract fears. Nondisclosure should never be based on an effort to protect the persona 
interests of Government officials at the expense of those they are supposed to serve. In responding to 
requests under the FOIA, executive branch agencies (agencies) should act promptly and in a spirit of 
cooperation, recognizing that such agencies are servants of the public. 



All agencies should adopt a presumption in favor of disclosure, in order to renew their commitment to 
the principles embodied in FOIA, and to usher in a new era of open Government. The presumption of 
disclosure should be applied to all decisions involving FOIA. 



http://www.whitehouse.Rov/the press office/Freedom of Information Act/ 




lee trie Cur ten i 
Abroad 
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Plufis in Commercial Use 




Type 



Flat blade 
attachment plug 



Type 



Flat blades with 
roimd grounding pin 



T>pa 



Round 
attachment 



Type 



"Schuko" plug and 
receptacle with side 

grounding contacts 



Type 



Round pin plug and 

receptacle with male 
grounding pin 



Type 



Rectangular blade 

plug 



s 



Type 



Type 



Round pins 

with ground 



Oblipe flat blade 
with ground 



Type 



Round pins 

with ground 



Type 



Round pins 

with ground 



Type 



Oblique flat blades 

with ground 



Type 



Round pins 

with ground 



1Q Electric CurrenL Abroad 



ElKlrlc Current ALirciad 1 1 



12 ElK trie Current Abroad 




http://www.mikero.com/blog/2009/02/20/more-darwin 



http://www. 




Ie.com/darwin2009 



Tom Moritz 



Internet Archive 



+1310 963 0199 



<moritz(5)archive.org> 




